Chao GENG Bo LIU Shigetoshi NAKATAKE
In integrated circuit design of advanced technology nodes, layout density uniformity significantly influences the manufacturability due to the CMP variability. In analog design, especially, designers are suffering from passing the density checking since there are few useful tools. To tackle this issue, we focus a transistor-array(TA)-style analog layout, and propose a density optimization algorithm consistent with complicated design rules. Based on TA-style, we introduce a density-aware layout format to explicitly control the layout pattern density, and provide the mathematical optimization approach. Hence, a design flow incorporating our density optimization can drastically reduce the design time with fewer iterations. In a design case of an OPAMP layout in a 65nm CMOS process, the result demonstrates that the proposed approach achieves more than 48× speed-up compared with conventional manual layout, meanwhile it shows a good circuit performance in the post-layout simulation.
Cloud computing, a novel distributed paradigm to provide powerful computing capabilities, is usually adopted by developers and researchers to execute complicated IoT applications such as complex workflows. In this scenario, it is fundamentally important to make an effective and efficient workflow application scheduling and execution by fully utilizing the advantages of the cloud (as virtualization and elastic services). However, in the current stage, there is relatively few research for workflow scheduling in cloud environment, where they usually just bring the traditional methods directly into cloud. Without considering the features of cloud, it may raise two kinds of problems: (1) The traditional methods mainly focus on static resource provision, which will cause the waste of resources; (2) They usually ignore the performance fluctuation of virtual machines on the physical machines, therefore it will lead to the estimation error of task execution time. To address these problems, a novel mechanism which can estimate the probability distribution of subtask execution time based on background VM load series over physical machines is proposed. An elastic performance fluctuations-aware stochastic scheduling algorithm is introduced in this paper. The experiments show that our proposed algorithm can outperform the existing algorithms in several metrics and can relieve the influence of performance fluctuations brought by the dynamic nature of cloud.
Bing XU Shouyi YIN Leibo LIU Shaojun WEI
Coarse Grained Reconfigurable Architectures (CGRAs) are promising platform based on its high-performance and low cost. Researchers have developed efficient compilers for mapping compute-intensive applications on CGRA using modulo scheduling. In order to generate loop kernel, every stage of kernel are forced to have the same execution time which is determined by the critical PE. Hence non-critical PEs can decrease the supply voltage according to its slack time. The variable Dual-VDD CGRA incorporates this feature to reduce power consumption. Previous work mainly focuses on calculating a global optimal VDDL using overall optimization method that does not fully exploit the flexibility of architecture. In this brief, we adopt variable optimal VDDL in each stage of kernel concerning their pattern respectively instead of the fixed simulated global optimal VDDL. Experiment shows our proposed heuristic approach could reduce the power by 27.6% on average without decreasing performance. The compilation time is also acceptable.
Chongyong YIN Shouyi YIN Leibo LIU Shaojun WEI
Compiler is the most important supporting tool to facilitate the use of reconfigurable computing architecture (RCA). In this paper, a template-based compiler framework is proposed. This compiler can synthesize the executables for RCA from native high-level programming language source code directly. It supports to generate run-time dynamic configuration context. And it is capable to generate both full configuration context and partial configuration context. Experimental results show that the executables generated by the proposed compiler can achieve better execution performance and smaller configuration context size than previous compilers. Moreover, this compiler does not require the programmer to have any extra knowledge about the hardware architecture of RCA.
Bo LIU Peng CAO Min ZHU Jun YANG Leibo LIU Shaojun WEI Longxing SHI
This paper presents a novel architecture design to optimize the reconfiguration process of a coarse-grained reconfigurable architecture (CGRA) called Reconfigurable Multimedia System II ( REMUS-II ). In REMUS-II, the tasks in multi-media applications are divided into two parts: computing-intensive tasks and control-intensive tasks. Two Reconfigurable Processor Units (RPUs) for accelerating computing-intensive tasks and a Micro-Processor Unit (µPU) for accelerating control-intensive tasks are contained in REMUS-II. As a large-scale CGRA, REMUS-II can provide satisfying solutions in terms of both efficiency and flexibility. This feature makes REMUS-II well-suited for video processing, where higher flexibility requirements are posed and a lot of computation tasks are involved. To meet the high requirement of the dynamic reconfiguration performance for multimedia applications, the reconfiguration architecture of REMUS-II should be well designed. To optimize the reconfiguration architecture of REMUS-II, a hierarchical configuration storage structure and a 3-stage reconfiguration processing structure are proposed. Furthermore, several optimization methods for configuration reusing are also introduced, to further improve the performance of reconfiguration process. The optimization methods include two aspects: the multi-target reconfiguration method and the configuration caching strategies. Experimental results showed that, with the reconfiguration architecture proposed, the performance of reconfiguration process will be improved by 4 times. Based on RTL simulation, REMUS-II can support the 1080p@32 fps of H.264 HiP@Level4 and 1080p@40 fps High-level MPEG-2 stream decoding at the clock frequency of 200 MHz. The proposed REMUS-II system has been implemented on a TSMC 65 nm process. The die size is 23.7 mm2 and the estimated on-chip dynamic power is 620 mW.
Shouyi YIN Dajiang LIU Leibo LIU Shaojun WEI
A coarse-grained reconfigurable architecture (CGRA) is typically hybrid architecture, which is composed of a reconfigurable processing unit (RPU) and a host microprocessor. Many computation-intensive kernels (e.g., loop nests) are often mapped onto RPUs to speed up the execution of programs. Thus, mapping optimization of loop nests is very important to improve the performance of CGRA. Processing element (PE) utilization rate, communication volume and reconfiguration cost are three crucial factors for the performance of RPUs. Loop transformations can affect these three performance influencing factors greatly, and would be of much significance when mapping loops onto RPUs. In this paper, a joint loop transformation approach for RPUs is proposed, where the PE utilization rate, communication cost and reconfiguration cost are under a joint consideration. Our approach could be integrated into compilers for CGRAs to improve the operating performance. Compared with the communication-minimal approach, experimental results show that our scheme can improve 5.8% and 13.6% of execution time on motion estimation (ME) and partial differential equation (PDE) solvers kernels, respectively. Also, run-time complexity is acceptable for the practical cases.